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Abstract 

Gene function prediction from microarray data is a first step toward 
better understanding the machinery of the cell from relatively cheap and 
easy-to-produce data. In this paper we investigate whether the knowledge 
of many metabolic pathways and their catalyzing enzymes accumulated 
over the years can help improve the performance of classifiers for this 
problem. 

The complex network of known biochemical reactions in the cell re- 
sults in a representation where genes are nodes of a graph. Formulating 
the problem as a graph-driven features extraction problem, based on the 
simple idea that relevant features are likely to exhibit correlation with 
respect to the topology of the graph, we end up with an algorithm which 
involves encoding the network and the set of expression profiles into ker- 
nel functions, and performing a regularized form of canonical correlation 
analysis in the corresponding reproducible kernel Hilbert spaces. 

Function prediction experiments for the genes of the yeast S. Cere- 
vtsiae validate this approach by showing a consistent increase in perfor- 
mance when a state-of-the-art classifier uses the vector of features instead 
of the original expression profile to predict the functional class of a gene. 

Keywords: microarray, gene expression, network, pathway, diffusion ker- 
nel, kernel CCA, feature extraction, function prediction. 

1 Introduction 

Following the near completion of many genome sequencing projects and the iden- 
tification of genes coding for proteins in these genomes, the research paradigm 
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is shifting toward a better understanding of the functions of the genes and their 
interactions. This discipline, broadly called functional genomics is expected to 
provide new insights into the machinery of the cell and suggest new therapeutic 
targets by better focusing on the precise molecules or processes responsible for 
a given disease. 

Functional genomics has been boosted since the mid 1990's by the intro- 
duction of the DNA microarray technology ]SSDB95| , [BBOOj , which enables the 
monitoring of the quantity of messenger RNA (mRNA) present in a cell for 
several thousands genes simultaneously, at a given instant. As mRNA is the in- 
termediate molecule between the blueprint of a protein on the DNA strand and 
the protein itself, it is expected that the quantity of mRNA reflects the quan- 
tity of the protein itself, and that variations in the quantity of mRNA when a 
cell is confronted to various experimental conditions reflects the genetic regu- 
lation process. Consequently functional characterization of a protein from its 
expression profile as measured by several microarray hybridation experiments is 
supposed to be possible to some extent, and initial experiments confirmed that 
many genes with similar function yield similar expression patterns [ESBB98|. 
As data accumulate the incentive to develop precise methods to assign functions 
to genes from expression profiles increases. 

Proteins can have many structural or functional roles. In particular proteins 
known as enzymes catalyze chemical reactions which enable cells to acquire en- 
ergy and materials from its environment, and to utilize them to maintain their 
own biochemical network. Decades of careful experiments have helped charac- 
terize many reactions taking place in the cell together with some of the genes 
playing a role in their control, and this information has now been integrated 
into several databases including WIT jOLP+OCH or KEGG ]KGKN02] . Such 
databases provide a view of the set of proteins as the nodes of a large and 
complex network, where two genes are linked when they catalyze two successive 
reactions. 

The question motivating this paper is whether this network can help im- 
prove the performance of function prediction algorithms based on microarray 
data only. To this end we propose a graph-driven feature extraction process from 
the expression profiles, based on the idea that patterns of expression which cor- 
respond to actual biological events, such as the activation of a series of chemical 
reactions forming a chemical pathway, are likely to be shared by genes close to 
each other with respect to the network topology. Translating this idea mathe- 
matically we end up with a features extraction process equivalent to performing 
a generalization of canonical correlation analysis (CCA) between the represen- 
tations of the genes in two different reproducing kernel Hilbert spaces, defined 
respectively by a diffusion kernel [ KL02 on the gene graph and by a linear ker- 
nel on the expression profiles. The CCA can be performed in these RKHS using 
the kernel-CCA algorithm presented in [BJ01]. 

Relationships between expression profiles and biochemical pathways have 
been subject to much investigation in the recent years. As microarray data are 
much cheaper to produce than precise pathway data, pathway reconstruction 
or validation from expression data has been attracting much attention since 



2 



the availability of public microarray data [FLNPOC, AMKOO 
clusters, i.e 



Extraction of co- 



clusters of genes in the network which have similar expression has 
also been investigated recently [NGK01, HZZL02|. On the technical point of 
view the integration of several sources of data has been investigated with differ- 
ent approaches, e.g., combining expression data and genomic location informa- 
tion in a Bayesian framework [ HG JY02 1 , combining expression data with phylo- 
genetic profiles by kernel operations [PWCGOlj, or defining dista nces betwe en 
genes by combining distances measured from different data types [ MPT + 99 |. 

This paper is organized as follows. Section || translates mathematically the 
feature extraction problem and contains basic notations and definitions, fol- 
lowed by a short review of some properties of RKHS relevant for our purpose in 
Section ||[ Sections || and || describe respectively how two important properties 
of features can be expressed in terms of norms in RKHS, and Section || describes 
the feature extraction process. Experimental results are presented in Section (7), 
followed by a discussion in Section |[ 



2 Problem definition 

2.1 Setting and notations 

Before focusing on expression profiles and biochemical pathways, we first formu- 
late in a more abstract way the problem we are dealing with. The set of genes is 
represented by a finite set X of cardinality \X\ — n, where each element x G X 
represents a gene. The information provided by the microarray experiments and 
the pathway database are represented respectively as: 

• a mapping e : X — > W, where e(x) is the expression profile for the gene 
x, for any x in X ', and p is the number of measurements available. In the 
sequel we assume that the profiles have been centered, i.e.: 

5>(a:)=0. (1) 

• A simple graph T = (X, £) (without loops and multiple edges) whose 
vertices are the genes X and whose edges £ represent the links between 
genes, as extracted from the biochemical pathway database. 

The notation x ~ y for any (x,y) € X 2 means that there is an edge between 
x and y, i.e., {x,y} G £ . Our goal in the sequel is to use the graph T in order 
to extract features from the expression profiles e relevant for the functional 
classification of the genes. In this context we formally define a feature to be a 
mapping / : X — > R, and we denote by T = R* the set of possible features. The 
set of centered features is denoted by Fq = {/ G T : J2 x gx f( x ) = 0}- F° r an Y 
feature / G T the same notation is used to represent the n-dimensional vector 
/ = (f(x)) xeX indexed by the elements of X, and /' denotes its transpose. The 
constant unit vector is denoted 1 = (!,...,!). 
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2.2 Feature relevance 

Features can be derived from the mapping e. As an example, projecting e to a 
given direction v £ W p gives the feature f ev defined for any x in X by: 

fe,v(x) = v'e(x). (2) 

If v represents a particular expression pattern, then f e ^ v quantifies how each 
gene correlates with this pattern. In this paper we restrict ourselves to such 
linear features, and denote by Q = {f e ,v, v £ MP} C T the set of linear features. 
Observe that by hypothesis (Q), each linear feature is also centered by (||), i.e., 
GcF . 

Biological events such as synthesis of new molecules or transport of sub- 
strates usually require the coordinated actions of many proteins. Genes encod- 
ing such proteins are therefore likely to share particular patterns of expression 
over different experimental conditions, e.g. simultaneous overexpression or in- 
hibition. A vector oeP representing this pattern should therefore be partic- 
ularly correlated (positively or negatively) with the genes participating in the 
biological process. As a result, linear features / ej „ corresponding to biologically 
relevant patterns v £ M. d are more likely to have a larger variance than those 
corresponding to patterns unrelated to any biological event, where the variance 
is defined by: 

Nl 

On the other extreme a pattern v £ R p orthogonal to all profiles leads to 
a feature f EyV with null variance, and is clearly unlikely to be related to any 
biological process requiring gene expression. It follows that the variance (^|) 
captured by a feature is a first indicator of its biological pertinence. In order 
to prevent confusion with other criteria in the sequel, we will call a feature 
relevant if it captures much variations between expression profiles in the sense 
of A3) , and irrelevant otherwise. The reader can observe that searching for the 
most relevant features can be done by performing a principal component analysis 



(PCA) [Iol96| of the profiles, the first principal components corresponding to 
the most relevant features; however we now show that relevance is not the only 
criterion which can be used to select features. 



2.3 Feature smoothness 



Relevance as defined in Section 2.2 is an intrinsic properties of the set of profiles, 
as it is defined in terms of variation captured, and no other information about 
the relationships between genes is used. 

Independently of any microarray experiment, many metabolic pathways have 
been experimentally characterized over the years. These collections of chemical 
reactions involve proteins as enzymes, whose presence or absence plays a major 
role in monitoring the reaction. Actual biological event usually involve series of 



4 



such reactions, also called pathways. Genes involved in consecutive reactions of 
pathways are likely to share particular patterns of expression, corresponding to 
the activation or not of the corresponding pathway. 

As a result a pattern v £ R p which corresponds to a true biological event, 
such as the activation or inhibition of a pathway, is likely to be shared by clusters 
of genes in the graph of genes where two genes are linked if they participate in 
consecutive reactions. On a more global scale, such a feature is more likely to 
vary smoothly on the graph of genes, in the sense that variations between linked 
genes be as small as possible, than a noisy pattern unrelated to any biochemical 
event which would not exhibit any particular correlation between genes linked 
to each other in the graph. 

Such features are called smooth in the sequel, by opposition to rugged fea- 
tures which vary a lot with respect to the graph topology. These notions are 
formalized and quantified in terms of a norm in a Hilbert space in Section ^, but 
before developing these technicalities we can already sketch a feature extraction 
process based on this intuitive definition. 



2.4 Problem formulation 



From the discussions in Sections 2.2 and 2.3 two criteria appear to characterize 



"good" candidate features : their relevance on the one hand (Section 2.2) based 



on a statistical analysis of the set of profiles, and their smoothness on the other 



hand (Section 2.3) which results from the analysis of the variations of the feature 
with respect to the topology of the graph of genes. 

Good candidate features are smooth and relevant in the same time. These 
two properties are however not always correlated: it might be possible to find 
many relevant but rugged features, as well as smooth but irrelevant features. A 
reasonable approach to extract meaningful features is therefore to try to find 
a compromise between these two criteria, and to extract features which are as 
smooth and relevant in the same time as possible. 

Although this statement can be translated mathematically in many different 
ways, we investigate in the sequel the following formulation: 

Problem 1 Extract pairs of features (/i,/2) € T§ x Q such that: 

• f\ be smooth, 

• /2 be relevant, 

• f\ and fi be correlated. 

These three goals are usually contradictory and a trade-off must be found 
between them. Observe that if either the smoothness or the relevance conditions 
are removed, the problem is likely to be ill-posed. For instance, if the smooth- 
ness requirement is removed then any relevant feature fi is perfectly correlated 
with itself; on the other hand if the relevance conditions disappears then many 
smooth features f\ can probably be correlated with linear features which are 
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not necessarily relevant (this possibility increases when the dimension p of the 
profiles increases, as the set of linear features increases too). 

Let us now formulate Problem [l] mathematically. The correlation between 
any two centered features (/i, fa) £ is equal to: 

nmrn- (4) 

As already mentioned the maximization of c(/i, fi) over Tq x Q is an ill-posed 
problem. 

Suppose we can define a smoothness functional hi : T — > R + for any feature, 
and a relevance functional hi :Q — > R + for linear features, in such a way that 
lower values of the functional hi (resp. h 2 ) corresponds to smoother (resp. more 
relevant) features. Then one way to formalize the trade-off between correlation 
and relevance / smoothness is to solve the following maximization problem: 

flh (5) 



(/i,/ 2 )e^ox5 y/fl f, + Wt (h)y/&h + Sh 2 (/ 2 ) 

where S is a regularization parameter. When 5 = we recover the ill-posed 
problem of maximizing the correlation (Q), and the larger S the smoother (resp. 
the more relevant) the feature fi (resp. fi) which solves (||). As a result, a 
solution (/1, fi) of (|j) is a reasonable solution to Problem [[], with 5 controlling 
the trade-off between correlation on the one hand, smoothness and relevance on 
the other hand. 

Equation (||) is therefore the problem we consider is the sequel. In order to 
solve it we need to 1) express the relevance and smoothness functional hi and 
hi mathematically and 2) solve the maximization problem (|^) with these func- 
tional . These two steps are not independent. In particular there is an incentive 
to express mathematically h\ and hi in such a way that (|J) be computationally 
solvable. 

If /1 and f% were restricted to be linear functionals obtained by projecting 
two different vector representations of the genes on particular directions, then 
the maximization of (p|) would be the exactly the first canonical correlation 
between /1 and fi \ Hot36| , as obtained by classical canonical correlation analysis 



(CCA). Linear algebra algorithms involving eigenvector decomposition exist to 
perform CCA. However /1 is not restricted to be a linear feature, and (^) is 
consequently ill-posed. 

Formulated as (|B|), however, we recover a slight generalization of CCA intro- 



duced in [BJ01] and called kernel-CCA. More precisely, kernel-CCA is formu- 
lated as: 

max — ; — , (o) 

Uuh)en^H 2 v^/i + 5\\h\Mftf2 + S\\f 2 \\n 2 

where Hi and H2 are two reproducible kernel Hilbert spaces (see Section ||) 
on the space X . Problem (^) is equivalent to a generalized eigenvalue problem 
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[BJ01 and can be solved iteratively to extract several pairs of features (see 
Section §J). 

In order to use the algorithm of [BJ01] we therefore need to restate Q in 
terms of optimization in RKHS like (|6|). This involves 1) expressing J-q as a 
RKHS whose norm is a smoothness functional (Section |2.3| ), 2) expressing Q as 
a RKHS whose norm is a relevance functional (Section |5|), and 3) solving the 
resulting problem (0). 



3 Reproducing kernel Hilbert space 



Before carrying out the program sketched in Section 2.4 we first recall some 
definitions and basic properties of RKHS in order to make this paper as self- 
contained as possible. Good introductions on RKHS can be found in |Aro50, 
3ai88, Wah9C, SS02] from which we borrow most of the materials presented in 
this section. 



3.1 Basic definitions 

Let X be a set (which we don't necessarily assume to be finite in this section), 
and K : X — > K a symmetric positive definite function, in the sense that for 
every I £ N and (xi, . . . ,xi) € X the I x I Gram matrix Kij — K(xi,Xj) be 
positive semidcfinitc. 

Then it is known that the linear span of set of functions {K(.,x),x € X} C 
M. x can be completed into a Hilbert space TL C M. x which satisfies the following 
"reproducing property" : 

V(f,x)eHxX, f(x) = (K(.,x)J) n , (7) 

where < ., . >n represents the inner product of TL. In particular, by plugging 
/ = K(.,x') in (Q) we obtain: 

V(x, x 1 ) e X 2 , (K(.,x),K(.,x')) H =K(x,x'). (8) 

The Hilbert space TL is called a reproducing kernel Hilbert space ]Aro50C to 
emphasize the property (^). In order to make this rather abstract result clearer, 
let us show how the space TL can be built when X is finite, which is the case of 
interest in this paper. 

Let us therefore take X to be the finite set of genes, and suppose first that 
the n x n Gram matrix K x> y — K(x,y) for any {x, y) € X 2 is positive definite, 
i.e., that its eigenvalues are all positive. It can then be diagonalized as follows: 

n 

K = J2Wi<t>l> (9) 
i=i 

where the eigenvalues satisfy < Ai < . . . < A„ and the set (</f>i, ... , <j> n ) G T n 
is an associated orthonormal basis of eigenvectors. 
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We can now take the Hilbert space to be TL — T, and define the inner product 
in TL in terms of the decomposition of any / € TL in the basis of eigenvectors: 

n 

f = j2<n<pi, (io) 

i=l 

as follows: 

($>^,$>0 4 ) = £^- (11) 

It is easy to check that the Hilbert space defined by ([ll]) satisfies the reproducing 
property (0), and is therefore a RKHS associated with the kernel K{., .). 

The columns of the Gram matrix being independent, any feature f € Ti. can 
be uniquely represented as follows: 

f(.) = ^a{x)K(x,.), (12) 
or in an equivalent matrix form: 

/ = Ka. (13) 

This representation is called the dual representation of /, and the vector a = 
{.ot{x)) xeX € T is called the dual coordinate of /. 

The dual representation is useful to express the inner product in the Hilbert 
space TL. Indeed, using ( [l2| ) and (||) it is easy to check that the inner product 
between two features (f,g) £ T 2 with dual coordinates (a, (3) € T 2 respectively 
is given by: 

(/.S)w= E a(x)0(y)K{x,y)=o/K0. 

In particular the 7i-norm of a feature / € T with dual coordinates a 6 T is 
given by: 

\\f\\ 2 H = a'Ka. (14) 

The inner product in the original space L 2 (X) can also simply be expressed 
with the dual representation: for any (/, g) S T 2 with dual coordinates (a, j3) 
respectively we have by (jl^) and using the fact that K is symmetric: 

fg = E /(x)«7(a;) = a'K 2 0. 
xex 

In case the kernel K is just positive semidefinite, with r being the multiplicity 
of as eigenvalue, then we can follow the same construction with the index i 
ranging from r + 1 to n in (|), @ and @. In that case the RKHS TL is the 
linear span of {(/v+i, ■ ■ ■ , <t>n}, of dimension n — r. The dual representation still 
makes sense but is defined up to an element of {a £ M. x , Ka = 0}. 
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3.2 RKHS and smoothness functional 



One classical application of the theory of RKHS is regularization theory to solve 
ill-posed problems fTA77| , [Iva76[ |Wah90[ pJP95| . Indeed it is well known that 
for many choices of kernels K(., .) on continuous spaces X C R N the norm in the 
corresponding RKHS \\f\\n is intimately related to the smoothness properties 
of the functions / € H. 

The following classical example is relevant for us. Consider a set X c I 1 *' and 
a translation-invariant kernel of the form K(x, y) = k(x — y) for any (x, y) E X 2 . 
Then the RKHS Ti. is composed of the functions / £ L 2 (X) such that: 



H 



duo < oo, 



(15) 



wher e J(uj) i s the Fo urier transform of / and v{oj) is the Fourier transform of 
k{.) pJP9| |SSM9gfl . Functionals of the form (||) are known to be smooth- 
ness functionals (in which case smoothness is defined in terms of Fourier trans- 
form, i.e., smooth functions are functions with few energy at high frequency), 
where the rate of decrease to zero of v controls the smoothness properties of 
the function in the RKHS. For example, for the Gaussian radial basis function 
k(x — y) = exp(— \\x — y\\ 2 /2a 2 ) the norm in the RKHS takes the form: 



H 



(27 



e"HMI 2 |/>)| 2 <^- 



(16) 



Equation ( |l6| ) shows that the energy of / at a frequency uj should decrease 
at least as exp ( — cr 2 ||dj|| 2 /2) for its 7i-norm to be finite. Functions with much 
energy at high-frequency have a large norm in Ti, which therefore acts as a 
smoothness functional. 



We refer the reader to [IA77 ; Iva76, Wah9C, GJP95| for more details on the 



connections between RKHS and smoothness functionals, as well as for applica- 
tions to solve ill-posed problems. In the sequel we will adapt these ap proa ches 
to discrete spaces X in order to fulfill the program sketched in Section 2.4 



4 Smoothness functional on a graph 

As pointed out in Sections |2.4| our interest is now to derive a "smoothness 
functional" for features / e T with respect to the graph T expressed as a norm 
in a RKHS. 

4.1 Fourier transform on graphs 

Equation ( |l5| ) shows that the norm in a RKHS on a continuous space associated 
with a translation-invariant kernel is defined in terms of Fourier transform. A 
natural approach to adapt the construction of smoothing functional to functions 
defined on a graph is therefore to adapt the Fourier transform to that context. 
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As a matter of fact Fourier transforms on graphs have been extensively studied 
in spectral graph theory [Chu97, Moh91, Moh97, |Sta96j , as we now recall. 



Let D be the n x n diagonal matrix of vertex degrees of the graph T, i.e., 



if x ^ y, 

deg(x) if x = y, 



where deg(x) is the number of edges involving x in T, and let A be the adjacency 
matrix defined by: 



V{x,y)(EX 2 , A x , y = 



Then the n x n matrix: 



1 if there is an edge between x and y in T. 
otherwise . 



L = D-A 

is called the (discrete) Laplacian of T. The discrete Laplacian L is a central 



concept in spectral graph analysis Moh97|. It shares many important properties 



with the familiar differential operator 

-A(.) = div(grad(.)) 

on Ricmannian manifolds. It is symmetric, semidcfinitc positive, and singular. 
The eigenvector (1, . . . ,1) belongs to the eigenvalue Ai = 0, whose multiplicity 
is equal to the number of connected components of T. 
Let us denote by 

= Ai < . . . < A„ 

the eigenvalues of L and {<fii,i — 1, . . . ,n} an orthonormal set of associated 
eigenvectors. Just like the Fourier basis functions are eigenfunctions of the 
continuous Laplacian on M. N , the eigenvectors of L can be regarded as a dis- 



crete Fourier basis on the graph T [3ta96|, with frequency increasing with their 
eigenvalues. 

Although the term "frequency" is not well defined for functionals on a graph, 
the reader can get an intuition of the fact that the functions (</>;, i — 1, . . . ,n) 
"oscillates" more and more on the graph as i increases through the following 
two well-known results: 



Applying the classical equality [ Moh97| 



V/ e T, f'Lf = £ (f(x) f(y)f 



to an eigenfunction <j) of L with eigenvalue A gives the following equality: 

^(0(x)-0(y)) 2 = A. (17) 

xr^y 

Equation ( |l7| ) confirms that the larger A, the more the associated eigen- 
function varies between adjacent vertices of the graph. 
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An other classical result concerns the number of maximal connected com- 
ponents of the graph where a feature has a constant sign. The first 
eigenfunction being constant, it has only one such component, namely 
the whole graph. For the other eigenfunctions, the discrete nodal do- 
main theorem which translate Courant's famous nodal theorem for ellip- 



tic operators on Riemannian manifolds [Cha84| to the discrete settings 
dV92, Fri93, vdH96, DGL+01] states that the number of maximal con- 



nected subsets of X where fa does not change sign is equal to i in the case 
where all eigenvalues have multiplicity 1 (see a more general statement in 
DGL+01 1). Together with the fact that each eigenfunction fa for i > 1 
has zero mean (because it is orthogonal to the constant function fa) this 
shows that fa "oscillates" more and more on the graph, in the sense that 
it changes sign more and more often as i increases. 

By similarity with the continuous case the basis (fa)i=x,... , n is called a 
Fourier basis, higher eigenvalues corresponding to higher frequencies. Any fea- 
ture / € J 7 can be expanded in terms of this basis: 



(18) 



where /; = faf and / = I /i, . . . , f n j is called the discrete Fourier transform 

of /. This provides a way to analyze features in the frequency domain, and in 
particular to measure their smoothness as we now show. 



4.2 Graph smoothness functional 

The Laplacian matrix L is semidefinite positive and can therefore be used as 
a Kernel Gram matrix. The multiplicity of as eigenvalue is the number of 
connected components of the graph, and the associated eigenvectors are the 
functions constant on each connected components. Following Section || the 
associated RKHS Ti has dimension n — r and is made of the set of features with 
zero mean on each connected component. By ([ll]) the norm of any function 
/ G H is given by: 



E 

t=r+l 



(19) 



where / is the Fourier transform of / (^8() and A is the ordered set of eigenvalues 
of L. 

However, 



4.1 



the smoothness of fa decreases with 



as shown in Section 
because Aj increases with i, the norm (|19J) in the RKHS associated with the 
kernel L increases with smoothness, and is therefore a "ruggedness functional" 
instead of a smoothness functional in the sense defined in Section ^. To illustrate 
this we can observe that: 

Vie{r + 1,... ,n}, \\fa\\n ' 
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hence decreases with i. 

Transforming this ruggedness functional into a smoothness functional can 
be performed by a simple operation on the kernel as follows: 

Definition 1 For any decreasing mapping ( : R + — * M + \{0}, we define the 
(-kernel K ( :X 2 ~*R by: 



1 1 

i=l 



where = Ai < . . . < X n are the eigenvalues of the graph Laplacian and 
(01, ... ,4> n ) an associated orthonormal Fourier basis. 

The mapping £ being assumed to take only positive values, the matrix K{ is 
definite positive and is therefore a valid kernel, with associated RKHS H = T . 
From the discussion above it is now clear that: 

Proposition 1 The norm ||.||^ in the RKHS associated with the kernel Kq is 
a smoothing functional, given for any feature f G T with Fourier transform 
f G R" by: 



Proof Equation ( p0| ) is a direct consequence of Definition [I] and ( pi] ) . The fact 
that ll-H^ is a smoothing functional is simply a translation of the fact that ((\i) 
decreases with i, hence the relative contribution of the Fourier components in 
(P0|) increases with their frequency. 



Proposition [l] shows that the smoothness functional associated with a func- 
tion C is controlled by its rate of decrease to 0. An example of valid £ function 
with rapid decay is the following: 

VxgR+, ((x)=e- TX , (21) 

where r is a parameter. In that case we recover the diffusion kernel introduced 
and discussed in [ KL02 . The authors of this paper show that the diffusion 
kernel shares many properties with the continuous Gaussian kernel K(x,y) = 
exp(— ||x — y\\ 2 /2a 2 ) on W, and can therefore be considered as its discrete 
version. 

Combining @ and @ we obtain that the norm in the RKHS associated 
with the diffusion kernel is given by: 

n 

V/G.F, ||/|k = £ eTAi j?> (22) 
i=i 

hence the high frequency energy of / is strongly penalized by this kernel, and 
the penalization increases with the parameter r. 
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Before continuing we should observe that in concrete applications the com- 
putation of the kernel for a given £ can be performed by diagonalizing the 
Laplacian matrix as: 

L = $'A$, 

where A is a diagonal matrix with diagonal element Aij = Xi, and computing: 

K c = $'C(A)$, 

where C(A) is a diagonal matrix with diagonal element C(A)i,i — C(^i)- We can 
also observe that the diffusion kernel can be written using the matrix exponential 
as: 

Kq = e 

Although other choices of £ lead to other kernels, discussing them would be 
beyond the scope of this paper so we will restrict ourselves to using the diffusion 
kernel as a smoothing functional in the sequel. The conclusion of this section 
is that by using the diffusion kernel we can build a RKHS H — T whose norm 
1 1. is a smoothness functional. 



5 Relevance functional 

Let us now consider the problem of defining a relevance functional. First observe 
that any direction v £ M. p with orthogonal projection vq on the linear span of 
{e(x),x € X} satisfies / e ,„ = f e ,v - As a result the search of linear features 
/e )tI can be restricted to directions belonging to this linear span, which can be 
parametrized as: 

v = P(xW*)> ( 23 ) 

where (3 E T is called the dual coordinate of v (defined up to an element of 
{/3e.F,A7? = 0}). 

The positive semidefinite Gram matrix K x y — e(x)'e(y), singular due to the 
centering of profiles (|l|), defines a RKHS TL C T which consists of features of 
the form: 



/(■) = Y,7(x)K(x,.) 
= l{x)e{x)'e(.) 



7(x)e(x) e(.), 
/ 

where 7 e f. Equation (|23|) shows that H. is exactly the set of linear features 



Q, and by (14) the semi-norm of Ti is given by: 

Vfe.v&G, \\f e , v \\n = /3'K0, (24) 
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where (3 is the dual coordinate of v defined by (E3). 

On the other hand, combining (|^), (||) and ( J23| ) shows that the variance of 
a feature f e . v G C/ can be expressed in terms of the dual coordinate (3 of v as 
follows: 



f(/e,t>) 



IMP 



From this we see that the larger the ratio between (3'K 2 (3 and f3' Kf3 the more 
relevant the feature f e ^ v , where v has dual coordinates (3. By observing that 
f v>e — K[3 and therefore f' ev fe.,v = f3'K 2 P, an d by we see that a natural 
relevance functional to plug into (||) in order to counterbalance the effect of f[fi 
is the following: 

h 2 (fe < v)=(3 , K(3=\\f e!V \\ n . (25) 

Indeed the larger h 2 {fe,v) compared to f' ev f e , v the smaller V(f CiV ), and therefore 
the less variation is captured by / e ,„- The functional (Eq) is defined on Q as the 



norm of a RKHS, which was the goal assigned in Section 2.4 



6 Extracting smooth correlations 
6.1 Dual formulation 

Let us now put together the elements we have developed up to now. In Section 
[I] we have shown that any feature / ef can be represented as: 

/ = K ia , 

where K\ is the diffusion kernel Gram matrix derived from the Laplacian ma- 
trix L by K\ = cxp(— tL), and a is the dual coordinate vector of / in the 
corresponding RKHS Hi = T . Moreover, we defined a smoothness functional 
as: 

VfeT, h 1 (f) = \\f\\ n , = a , K 1 a. 

In Section || we showed that every linear feature f e<v G Q can also be repre- 
sented in a dual form: 

fe,v = K 2 (3, 

where K 2 is the kernel Gram matrix K 2 (x,y) = e(x)'e(y) for any (x,y) G X 2 
and (3 is the dual coordinate vector in the corresponding degenerate RKHS 
7^2 = Q- Moreover a relevance functional was defined as: 

VveR p , h e (f e , v ) = \\f\\ n2 =(3'K 1 p. 
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Plugging these results into (||) leads to the following formulation of the initial 
problem in terms of dual coordinates: 



, ™PU7(<*>/3)) ( 26 ) 



with 



7(a,/3) = ; -. (27) 

{a' (K\ + SKi) a) 2 {fif (K$ + 5K 2 ) 8) 2 

Observe that this is the dual formulation of (|^) except that the optimization 
is done in T x Q instead of J^o x 5- Moreover, in order keep the interpretation 
of 1 1. 1 1 -Hi as a smoothing functional the kernel K± should not be centered in 
the feature space, as in usual kernel CCA p3J01| and kernel PCA pSM99 fl. 
As the following Proposition shows, this is however not a problem because the 
features whose dual coordinates maximize (^6|) are centered anyway, and the 
optimization in for f\ G T is therefore equivalent to the maximization for / G 

Proposition 2 For any [a, 3) G T 2 , let a be the dual coordinate of the cen- 
tered version of f = K\a, i.e.: 

JaeGK, KxetQ = Kia + el, 
[Ysxex Kia {x) = 0. 

Then the following holds: 

j(a ,j3)> 7 (a,/3), 

with equality if and only if a — a^. In particular, the features whose dual 
coordinates a and 8 solve n2a) are centered. 



Proof Because the profiles {e(x),x G X} are supposed to be centered we have 
K 2 1 — 0, and therefore: 

a'KiK 2 p = {a' Ki + el')K 2 8 = a' KiK 2 [3. 

Let ((f>i,... ,<fi n ) denote an orthonormal Fourier basis, where <f>\ is constant. 
Then any feature / = K\u is centered by removing the contribution of <p\ in its 
Fourier expansion, i.e., 

n 

/o = Kia Q = yji<f>j. 

i=2 
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As a result we obtain from ([Tl]): 

a'Kia = 



Hi 
'2 



" f2 

Eli. 

i— i 

?2 



> 



l^ao + ell 12 
l^i«o|l! 2m + ||el|' 2 



Eli. 

»=2 1 

= \\Ka Q \\ Hl 
= a Kia , 

where the inequality on the third line is an equality if and only if f\ = 0, i.e., / 
is centered. Moreover, using Pythagorean equality in L 2 (X) for the orthogonal 
vectors 1 and Ka>o we easily get: 

a'Kia = \\f\\ L 2 {x) 

' '\L 2 {X) 
\L 2 {X) + \\ el \\L 2 (X) 
\l 2 {X) 

= a K 1 a 

Combining this inequalities with the definition of 7 (^6|) proves the Lemma. 
6.2 Features extraction 

Stated as ( |2q) th e problem is similar to the kernel canonical correlation problem 
studied in |BJ01| . In particular, by differentiating with respect to a and (3 we see 
that (a, f3) is a solution of (^6|) if and only if it satisfies the following generalized 
eigenvalue problem: 

K X K 2 \( a\_ ( KI + 5K, \( a \ 

K 2 K X )[p)- p \ KI + 6K 2 ){(3) (28) 



> ll^iaol 12 



with p the largest possible. The reader is referred to |BJ01 for details about 
the derivation of (Pq) . Let n = min(n,p). As pointed out in this paper solving 
( p8| ) provides a series of pairs of features: 

{(a t ,f3 t ),i = 1,. . . ,n} 

with decreasing values of 7(aj,/3j) for which the gradient V Qj /37 is null, equiv- 
alent to the extraction of successive canonical directions with decreasing corre- 
lation in classical CCA. The resulting features /i^ = K\ai and f 2 ,i = K 2 fii are 
therefore a set of features likely to have decreasing biological relevance when i 
increases, and are the features we propose to extract in this paper. 

The classical way to solve a generalized eigenvalue problem Bp = XCp is 
to perform a Cholesky decomposition of C as C = E'E, to define p — Ep and 
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to solve the standard eigenvector problem = A/i. However the 

matrix K 2 + 5K 2 is singular so it must be regularized for this approach to be 
numerically stable. Following [BJ01 this can be done by adding <5 2 /4 on the 
diagonal, and observing that: 



r2 / r 

K 2 + 5K + — I = f K + -I 



leads to the following regularized problem: 



K1K2 
K 2 K X 



= P 



{Ki + s'iy 






(K 2 + 8' if 



(29) 



where 5' = 6/2. If (a, f3) is an generalized eigenvector solution of ( |29| ) belonging 
to the generalized eigenvalue p, then (—a, 0) belong to —p. As a result the 
spectrum of ( p9| ) is symmetric : (pi, — pi, . . . , p n , — p n ) with p\ > ... > p n , 
Pi = for i > p. 



6.3 Feature extraction process 

Solving ( p9| ) results in two sets of features {K\ai, i = 1, . . . , n} and {-K2A, « = 
1, . . . ,n}. Features of the form Ka\ are computed from the position of the 
genes in the gene graph, while features of the form K 2 (3 are computed from the 
expression profiles. 

In concrete applications, the position of a still uncharacterized gene in the 
gene graph is not known, while its expression profile can be measured. As a 
result the only way to extract features for such a gene is to use the features 
{K 2 (3i, i — 1, . . . ,h}. These features are obtained by projecting the expression 
profiles to the respective directions: 

tH = fti(x)e(x), i = l,...,n. (30) 

Therefore features can be extracted from any expression profile e by projections 
on these directions. We can now summarize a typical use of the the feature 
extraction process presented in this paper as follows: 

• The set of genes X is supposed to be the disjoint union of two subsets X\ 
and X 2 . Expression profiles are measured for all genes, but only genes in 
X\ are present in the gene network Q = (Xi,£). Hence X\ is the set of 
genes which have been assigned a precise role in a pathway, while X 2 is 
the set of uncharacterized genes. 

• Use the set X\ to extract features from the set of expression profiles 
{e(x), x S Xi} using the graph G, by solving (p9|). 

• Derive a set of expression patterns by (|30|). 
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• Extract features from the expression profiles {e(x), x € by projecting 
them on the derived expression patterns. 

This process provides a way to replace the expression patterns of an unchar- 
acterized gene by a vector of features which hopefully are more biologically 
relevant than the raw profiles themselves. Any data mining algorithms, e.g. 
clustering of functional classification methods, can then be applied on this new 
representation. 



7 Experiments 

In order to evaluate the relevance of the pathway-driven features extraction pro- 
cess presented in this paper we performed functional classification experiments 
with the genes of the yeast Saccharomyces Cerevisiae. The main goal of these 
experiments is to test whether a state-of-the-art classifier, namely a support 
vector machine, performs best by working directly with the expression profiles 
of the genes, or by using the vectors of features. 



7.1 Pathway data 

The L IGAND database of chemical compounds and reactions in biological path- 
ways [ |GOH + 02 , GNK98| is p art of the Kyoto Encyclopedia of Genes and Genomes 



(KEGG) jKGKN02| , |Kan97| . As of February 2002 it consists of a curated set 
of 3579 metabolic reactions known to take place in some organisms, together 
with the substrates involved and the classification of the catalyzing enzyme as 
an EC number. To each reaction are associated one or several EC numbers, and 
to each EC number are associated one or several genes of the yeast genome. 
Using this information we created a graph of genes by linking two genes when- 
ever they were assigned two EC number known to catalyze two reactions which 
share a common main compound (secondary compounds such as water or ATP 
are discarded). 

In other words two genes are linked in the resulting graph if they have the 
possibility to catalyze two successive reactions, the main product of the first 
one being the main substrate of the second one. Although it is far from being 
certain that all the genes candidates to catalyze a given reaction (because they 
are assigned an EC number supposed to represent a family of potential enzymes 
catalyzing the reaction) actually catalyze it in the cell, these data nevertheless 
provide a global picture of the possible relationships between genes in terms 
of catalyzing properties. In particular a path in this graph corresponds to a 
possible series of reactions catalyzed by the successive genes met along the path. 

The resulting graph involves 774 genes of S. Cerevisiae, linked with 16,650 
edges. 
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7.2 Microarray data 

Publicly available microarray expression data were collected from the Stanford 
Microarray Database |SHBK + 01|. The data include yeast response to various 
experimental conditions, including metabolic shift from fermentation to respi- 
ration [DIB97], alpha-factor block release, cdcl5 block release, elutriation time 
course, cyclin over-expression [3SZ+9S], sporulation [CDE+98 , adaptive evo- 
l ution [FBBR9S[ ], stress response ]GSK + 00[ , manipulation in phosphate level 
[ODB00], cell cycle [ZSV + 00|, growth conditions of excess copper or copper 
deficiency lGKI+00[ , DNA damage response ]GHM + 01|, and transfer from a 
fermentable to a nonfermentable carbon source [KDBS01]. 

Combining these data results in 330 data points available for 6075 genes, 
i.e., almost all known or predicted genes of S. cerevisiae. Each data point 
produced by a DNA microarray hybridation experiment represents the ratio 
of expression levels of a particular gene under two experimental conditions. 
Following ]ESBB98| |BGL+0(fl we don't work directly with this ratio but rather 
with its normalized logarithm defined as: 



\/(x,i) e X x {!,... ,330}, e(x)i 



log E X: i/R Xti 



Eti log 2 E x ,i/R X!i 



where E x ^ is the expression level of gene x in experiment i and Ri is the expres- 
sion level in the corresponding reference state. Missing values were estimated 
with the software KNNimpute fTCS+01 . 



7.3 Functional classes 

The January 10, 2002, version of the functional classification catalogue of the 
Comprehensive Yeast Genome Database (CYGD) [MFG + 02] is a comprehen- 
sive classification of 3936 yeast genes into 259 functional classes organized in 
a hierarchy. The classes vary in size between 1 and 2258 genes (for the class 
"subcellular localization" ) , and not all of them are supposed to be correlated 
with gene expression BGL+00]. Only classes with at least 20 genes (after re- 
moving the genes present in the gene graph, see next Section) are considered 
as benchmark datasets for function prediction algorithm in the sequel, which 
amounts to 115 categories. 



7.4 Gene function prediction 



Following the general approach presented in Section 6.3 the gene prediction 
experiment involves two steps: 

• The 669 genes in the gene graph derived from the pathway database with 
known expression profiles are used to perform the feature extraction pro- 
cess by solving (|30|). 
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• The resulting linear features are extracted from the expression profiles of 
the disjoint set of 2688 genes which are in the CYGD functional cata- 
logue but not in the pathway database. Systematic evaluation of the per- 
formance of support vector machines to predict each CYGD class either 



from the expression profiles themselves BGL+00] or from the features ex- 



tracted is then performed on this set of genes using 3-fold cross-validation 
averaged over 10 iterations. 



Support vector machine (SVM) |Vap9q , |CST0C| , |SS02| is a class of machine 
learning algorithms for supervised classification which has been shown to per- 
form better that other machine learning techniques, including Fisher's linear 
discriminant, Parzen windows and decision trees on the problem of gene func- 
tional classification from expression profiles [ BGL+00 1. We therefore use SVM 
as a state-of-the-art learning algorithm to assess the gain resulting from replac- 
ing the original expression profiles by vectors of feat ures. 

Experiments were carried out with SVM Light [Joa99|, a public and free 
implementation of SVMs. To ensure a comparison as fair as possible between 
different data representations, all vectors were scaled to unit length before being 
sent to the SVM, and all SVM used a radial basis kernel with unit width, i.e., 



k(x,y) = exp(- 



y\\ 2 ). The trade-off parameter between training error and 



margin was set to its default value (namely 1 in the case where all vectors have 
unit length), and the cost factor by which training errors on positive examples 
outweigh errors on negative examples was set equal to the ratio of the number 
of positive examples and the number of negative examples in the training set. 

We compared the performance of SVM working directly on the expression 
profiles, as in [BGL + 00|, with SVM working on the vectors of features extracted 
by the procedure described in this paper, for various choices of regularization 
parameters 6, width of the diffusion kernel r and numbers of features selected. 

For each experiment the performance is measured by the ROC index, defined 
as the area under the ROC curve, i.e., the plot of true positives versus false 
positives, and normalized to 100 for a perfect classifier. The ROC curve itself 
is obtained by varying a threshold and classify genes by comparing the score 
output by the SVM with this threshold. A random classifier has an average 
ROC index of 50. 



7.5 Setting the parameters 

Our feature extraction process contains two free parameters, namely the width 
t of the diffusion kernel and the regularization parameter 5. Intuitively, the 
larger r and d, the smoother and more relevant the features extracted, at the 



expense of a decrease between their correlations. As pointed out in BJ01 the 
parameter 8 is expected to decrease linearly with n, and a reasonable value is 
S — 0.001 for n of the order of 1000. An initial value of r = 1 was chosen. 

We varied independently S and r in order to check their influence. For 
a fixed 5 = 0.001 we tested the performance of SVM based on the features 
extracted with the parameter r S {0.5, 1, 2, 5}, where all 330 features are used. 
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Table 1: Performance comparison for various r 



5 t Average ROC Percentage of classes best predicted 

CT001 05 6L4 37 

0.001 1 61.4 35 

0.001 2 60.0 20 

0.001 5 55.2 8 



Table 2: Performance comparison for various 5 



s 


T 


Average ROC 


Percentage of classes best predicted 


0.0005 


1 


61.4 


17 


0.001 


1 


61.4 


18 


0.002 


1 


61.4 


25 


0.005 


1 


61.6 


39 



Table | shows the ROC index averaged over all 115 classes with more than 
20 genes for each of the four SVM, as well as the percentage of classes best 
predicted by each method. The best performance is reached for r = 1, with 
an important deterioration when r increases to 5. A larger r means by ( p2|) 
that rugged features are more strongly penalized, so larger r tend to generate 
smoother features. The deterioration when r increases shows the importance of 
not excessively penalizing ruggedness. 

We also checked the influence of the regularization parameter 5, which con- 
trols the trade-off between correlation on the one hand, smoothness and rele- 
vance on the other hand. Table [2] compares the performances of SVM based on 
the features extracted with the parameters r = 1 and 5 G {0.0005, 0.001, 0.002, 0.005}. 
This shows a small (in terms of ROC index increase) but consistent (in terms 
of number of classes best predicted) increase in performance when 8 increases 
from 0.0005 to 0.005. This illustrates the importance of regularization, and 
therefore the improvement gained by imposing some smoothness and relevance 
constraints to the features. 



7.6 Number of features 

From now on we fix the parameters to r = 1 and S = 0.001. As the feature 
extraction process is supposed to extract up to p = 330 features by decreasing 
biological relevance, one might ask if classification performance could increase 
by only keeping the most relevant features, and hopefully removing noise by 
discarding the remaining ones. To check this we measured the performance of 
SVM using an increasing number of features. Results are shown on Table ^, and 
show that it is on average more interesting to use all features as the performance 
increases with the number of features used. Exceptions to this average principle 
include classes such as fermentation, ionic homeostasis, assembly of protein 
complexes, vacuolar transport, phosphate metabolism or nucleus organization, 



which are better predicted with less than 100 features as shown on Figure 7.6 
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Table 3: Performance comparison for various numbers of features, with S = 



0.001 and r = 1 



Number of features 


Average ROC 


Percentage of classes best predicted 


50 


55.3 


3 


100 


57.9 


10 


150 


58.9 


9 


200 


59.9 


7 


250 


60.6 


17 


300 


61.2 


17 


330 


61.4 


37 




Figure 1: Classification performance for various classes 
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Figure 2: Comparison of the classification performance of SVM based on ex- 
pression profiles (y axis) or extracted features (x axis). Each point represents 
one functional class. 



7.7 Functional classification performance 

In order to check whether the features extraction provides any advantage over 
the direct use of expression profiles for gene function prediction we finally com- 
pared the performance of a SVM using all features extracted with the parameters 
5 = 0.001 and r = 1, with the performance of a SVM using directly the gene 



expression profiles. Figure 7.7 shows the ROC index obtained by each of the 
two methods for all 115 functional classes. Except for a few classes, there is a 
clear improvement in classification performance when the genes are represented 
as vectors of features, and not directly as expression profiles. 

Table || shows that the ROC index averaged over all classes increases signif- 
icantly between the two representations (from 54.9 to 61.2). Moreover Figure 
[7.7| shows that most of the classes seem almost impossible to learn from their 
expression profiles only (when the ROC index is around 45 - 55, i.e. not bet- 
ter than a random classifier), but can somehow be learned by their vectors of 
features, as the ROC index jumps in the range 55-65 for many of those classes. 
Some classes exhibit a dramatic increase in ROC index, as shown in Table 
which lists the classes largest absolute increase in ROC index between the two 
experiments. 
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Table 4: ROC index averaged over 115 functional classes by SVM using different 
representations of the data 



Data representation 


Average ROC 


Expression profiles 


54.6 


Vector of features 


61.4 



Table 5: ROC index for the prediction of categories based on expression profiles 
or features vectors. The categories listed are the one which exhibit the largest 
increase in ROC index between these two representations. 



Class 


Expression 


Features 


Increase 


Heavy metal ion transporters (Cu, Fe, etc.) 


55.2 


83.5 


+28.3 


Ribosome biogenesis 


70.9 


94.6 


+23.7 


Protein synthesis 


61.6 


84.3 


+22.7 


Directional cell growth (morphogenesis) 


44.3 


64.7 


+20.4 


Regulation of nitrogen and sulphur utilization 


49.0 


68.6 


+19.6 


Nitrogen and sulfur metabolism 


44.3 


63.8 


+19.5 


Translation 


50.7 


69.8 


+19.1 


Cytoplasm 


55.0 


73.4 


+18.4 


Endoplasmic reticulum 


59.5 


77.0 


+17.5 


Amino acid transport 


75.1 


58.3 


+16.8 



8 Discussion and conclusion 

This paper proposes an algorithm to extract features from gene expression pro- 
files based on the knowledge of a biochemical network linking a subset of genes. 
Based on the simple idea that relevant features are likely to exhibit correla- 
tion with respect to the topology of the network, we end up with a formulation 
which involves encoding the network and the set of expression profiles into to 
kernel functions, and performing a regularized canonical correlation analysis in 
the corresponding reproducible kernel Hilbert spaces. 

Results presented in Section ^ are encouraging and confirm the intuition that 
incorporating valuable information, such as the knowledge of the precise position 
of many genes in a biochemical network, helps extracting relevant informations 
from expression profiles. While this problem has still attracted relatively few at- 
tention because the number of expression data has always been small compared 
to the number of genes until recently, it is expected to be more and more impor- 
tant as the production of expression data becomes cheaper and the underlying 
technology more widespread. 

A detailed analysis of the experimental results reveals that functional cate- 
gories related to metabolism, protein synthesis and subcellular localization ben- 
efit the most from the representation of genes as vectors of features. In the case 
of metabolism and protein synthesis related categories, this can be explained by 
the fact that many pathways related to this process are present in the pathway 
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database, so relevant features have probably been extracted. The case of sub- 
cellular localization proteins is more surprising, as they seem to be more related 
to structural properties than functional properties of the genes, but certainly 
reflects the functional role of the organelles themselves. As an example a sud- 
den need of energy might promote the activity in mitochondria and require the 
synthesis of proteins to be directed to this location, even though they might not 
be directly involved as enzymes. 

On the technical point of view the approach developed in this paper can 
be seen as an attempt to encode various types of information about genes into 
kernels. The diffusion kernel K\ encodes the gene network, and the linear 
kernel K2 summarizes the expression profiles. Recent research shows that this 
approach can in fact be generalized to many other sources of information about 
genes, as many kernels have been engineered and continue to be developed for 
particular types of data. Apart from classical kernels for finite-dimensional 
real- valued vectors [Vap98] which can be used to encode any vectorial gene 
representation, e.g. expression profiles, and from diffusion kernels which can 
encode any gene network, e.g. network derived from biochemical pathway or 
protein interaction networks, relevant examples of recently developed kernels 
include the Fisher kernel to encode how the amino-acid sequence of a protein is 
related to a given hidden Markov model JDHOO or to encode the arrangement of 
transcription factor binding site motifs in its promoter region PWCG01], several 
string kernels to encode the information present in the amino-acid sequence 
itself pau99|, |WatOO| |LEN02|, |Vcr02a|, [LSST+02|], or a tree kernel to encode 



the phylogenetic profile of a protein [Ver02b|. This increasing list suggests a 
unified framework to represent various types of informations, which is obtained 
by "kcrnclizing the protcomc" , i.e., tranforming any type of information into an 
adequate kernel. 

Parallel to the apparition of new kernels recent years have witnessed the 
development of new methods, globally referred to as kernel methods, to perform 
various data mining algorithm from the knowledge of the kernel matrix only. 
Apart from the most famous support vector machine algorithm for classification 
and regression |BGV92 , Vap98|, o ther kernel methods include princi pal compo- 
ncnt analysis [ [SSM99| , clustering fBffiiSVOl| , Fisher discriminants |MRW+99 | 
or independent component analysis [BJ01|. 

These recent developments open the door to new analysis opportunities 
which we believe can be particularly suited to the new discipline of proteomics 
whose central concepts, genes or proteins, are defined through a variety of dif- 
ferent points of view (as sequences, structures, expression patterns, position in 
networks, ...), the integration of which promises to unravel some of the secrets 
of life. 
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